Skip to content

Job Launcher and Job Handle: design doc/implementation/unit tests (TA: NVFlare developers)#4336

Open
IsaacYangSLA wants to merge 3 commits intoNVIDIA:mainfrom
IsaacYangSLA:add_k8s_job_launcher_and_design_docs
Open

Job Launcher and Job Handle: design doc/implementation/unit tests (TA: NVFlare developers)#4336
IsaacYangSLA wants to merge 3 commits intoNVIDIA:mainfrom
IsaacYangSLA:add_k8s_job_launcher_and_design_docs

Conversation

@IsaacYangSLA
Copy link
Collaborator

Description

Job Launcher and Job Handle design docs
Implementation of Job Launcher and Job Handle for kubernetes (docker is WIP)
Unit tests for implementation

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings March 19, 2026 00:04
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 19, 2026

Greptile Summary

This PR introduces JobLauncherSpec / JobHandleSpec abstraction interfaces, a Kubernetes-backed implementation (K8sJobLauncher / K8sJobHandle), two new Best-Effort resource management classes (BEResourceManager, BEResourceConsumer), an ignore_host flag for GPUResourceManager to support K8s deployments, a comprehensive unit-test suite, and a design document.

Key changes:

  • nvflare/apis/job_launcher_spec.py: Clean new spec layer defining JobHandleSpec, JobLauncherSpec, JobProcessArgs, and JobReturnCode. Well-structured and consistent with existing NVFlare component conventions.
  • nvflare/app_opt/job_launcher/k8s_launcher.py: Full K8s launcher. Several bugs flagged in earlier review rounds have been fixed in this version — None module-arg values are now skipped when building pod container args, terminate() sets terminal_state in all code paths, wait() persists the terminal state so subsequent poll() calls remain accurate, and the return value of enter_states() is now checked in launch_job(). One remaining suggestion: emit a logger.warning when a data-PVC YAML file contains more than one entry, since currently only the first key is used and extras are silently discarded.
  • Resource managers: BEResourceManager and BEResourceConsumer are minimal, correct best-effort implementations. The new ignore_host parameter in GPUResourceManager cleanly addresses the split-plane K8s topology.
  • Tests: Coverage is thorough across UUID conversion, pod-state/return-code mapping, full manifest construction, stuck-detection, wait/poll/terminate lifecycle, and launcher initialization. Two minor test-quality issues in configer_test.py: an assertion inside the for-loop in test_merge_configs exits on the first mismatch rather than reporting all failures, and test_add_and_remove_config_keys uses list(file_merged.items())[0] which assumes dict insertion order for its correctness.

Confidence Score: 4/5

  • This PR is safe to merge with minor style improvements recommended.
  • The implementation is well-structured and the majority of bugs identified in prior review rounds have been addressed in the current HEAD. The remaining comments are style-level (silent PVC key selection, hardcoded poll interval, fragile test assertions) rather than correctness bugs. The unit test coverage is comprehensive and the design doc matches the implementation.
  • nvflare/app_opt/job_launcher/k8s_launcher.py (data PVC silent first-key selection), tests/unit_test/tool/job/config/configer_test.py (fragile test assertions)

Important Files Changed

Filename Overview
nvflare/app_opt/job_launcher/k8s_launcher.py Core Kubernetes launcher and handle implementation. Several issues flagged in prior review rounds are now resolved (None args in pod spec skipped, terminal_state set in all terminate() paths, wait() persists final state, enter_states return value checked). One style concern: silent first-key selection when data PVC file has multiple entries.
nvflare/apis/job_launcher_spec.py New spec interfaces JobHandleSpec and JobLauncherSpec plus JobProcessArgs constants and add_launcher helper. Clean, well-structured abstract layer; no issues found.
nvflare/app_common/resource_managers/BE_resource_manager.py New BestEffort resource manager that always accepts resource allocation requests. Correctly implements the ResourceManagerSpec interface; design intent (let jobs fail at runtime rather than at scheduling) is documented.
nvflare/app_common/resource_consumers/BE_resource_consumer.py Trivial BestEffort resource consumer with a no-op consume() method. Correct and complete.
nvflare/app_common/resource_managers/gpu_resource_manager.py Adds ignore_host parameter to GPUResourceManager to support Kubernetes deployments where the NVFlare control plane runs on a non-GPU node. Logic and validation are correct.
tests/unit_test/app_opt/job_launcher/k8s_launcher_test.py Comprehensive unit test suite covering uuid conversion, pod-state mapping, manifest building, stuck detection, wait/poll/terminate lifecycle, and launcher initialization. Minor style issues: hardcoded sleep values not mocked in wait() tests, and test_max_stuck_count_uses_pending_timeout_when_no_timeout relies on implicit default value of pending_timeout.
tests/unit_test/app_opt/job_launcher/init.py Empty init.py to make the test directory a Python package. Correct and necessary.
tests/unit_test/tool/job/config/configer_test.py Adds test_add_and_remove_config_keys and test_split_key to the configer test suite. Minor issues: assertion inside the for-loop in test_merge_configs exits early on first failure, and index-zero assumption in test_add_and_remove_config_keys is fragile if dict order changes.
docs/design/JobLauncher_and_JobHandle.md New design document covering JobLauncherSpec/JobHandleSpec architecture, Kubernetes implementation details, and the event-based launcher selection mechanism. Well-written and matches the implementation.

Sequence Diagram

sequenceDiagram
    participant UL as Upper Layer<br/>(ServerEngine/ClientExecutor)
    participant EH as Event System
    participant KL as K8sJobLauncher
    participant KH as K8sJobHandle
    participant K8s as Kubernetes API

    UL->>EH: fire BEFORE_JOB_LAUNCH event
    EH->>KL: handle_event(BEFORE_JOB_LAUNCH, fl_ctx)
    KL->>KL: extract_job_image(job_meta, site_name)
    alt job_image present
        KL->>EH: add_launcher(self, fl_ctx)
    end

    UL->>KL: launch_job(job_meta, fl_ctx)
    KL->>KL: uuid4_to_rfc1123(raw_job_id)
    KL->>KL: get_module_args(job_id, fl_ctx)
    KL->>KH: K8sJobHandle(job_id, core_v1, job_config)
    KH->>KH: _make_manifest(job_config)
    KH-->>KL: job_handle

    KL->>K8s: create_namespaced_pod(pod_manifest)
    alt pod creation fails
        K8s-->>KL: Exception
        KL->>KH: terminal_state = TERMINATED
        KL-->>UL: job_handle (terminated)
    else pod created
        K8s-->>KL: ok
        KL->>KH: enter_states([RUNNING])
        loop poll until RUNNING or timeout
            KH->>K8s: read_namespaced_pod(job_id)
            K8s-->>KH: pod phase
            alt stuck in PENDING
                KH->>K8s: delete_namespaced_pod(job_id)
                KH->>KH: terminal_state = TERMINATED
            end
        end
        KH-->>KL: True/False
        KL-->>UL: job_handle
    end

    UL->>KH: wait()
    loop until terminal
        KH->>K8s: read_namespaced_pod(job_id)
        K8s-->>KH: phase (SUCCEEDED/FAILED/RUNNING)
        alt SUCCEEDED or FAILED
            KH->>KH: terminal_state = job_state
        end
    end

    UL->>KH: poll()
    KH-->>UL: JobReturnCode (SUCCESS/ABORTED/UNKNOWN)
Loading

Last reviewed commit: "Fix reference error"

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation, Kubernetes-oriented job launching support, and unit tests around the new JobLauncher/JobHandle abstractions, plus “best effort” resource management utilities intended for orchestration backends.

Changes:

  • Add detailed design documentation for JobLauncherSpec/JobHandleSpec and the Process/Docker/K8s implementations.
  • Extend the Kubernetes launcher/handle to build PVC-backed pod manifests, support GPU limits, and add timeout/stuck-handling logic.
  • Introduce BE (best-effort) resource manager/consumer utilities and expand GPUResourceManager with an ignore_host option.
  • Add unit tests for K8s and Docker job launchers/handles.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
docs/design/JobLauncher_and_JobHandle.md New design doc describing the launcher/handle architecture and backends.
nvflare/apis/job_launcher_spec.py Docstring updates to clarify expected return values.
nvflare/app_opt/job_launcher/k8s_launcher.py Major updates to K8s pod manifest creation, lifecycle handling, and resource/PVC wiring.
nvflare/private/fed/client/communicator.py Adds configurable timeout for waiting on cell creation during client registration.
nvflare/app_common/resource_managers/gpu_resource_manager.py Adds ignore_host option to skip host GPU validation.
nvflare/app_common/resource_managers/BE_resource_manager.py New “best effort” resource manager implementation.
nvflare/app_common/resource_consumers/BE_resource_consumer.py New no-op resource consumer.
tests/unit_test/app_opt/job_launcher/k8s_launcher_test.py New unit tests for K8s job handle/launcher behavior.
tests/unit_test/app_opt/job_launcher/docker_launcher_test.py New unit tests for Docker job handle/launcher behavior.
tests/unit_test/app_opt/job_launcher/__init__.py Test package init.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 8607f8a to 5ac3bf1 Compare March 19, 2026 00:48
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 5ac3bf1 to 7eb5b0e Compare March 19, 2026 01:50
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 7eb5b0e to 1048b0b Compare March 19, 2026 02:05
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch 2 times, most recently from de9d011 to 9708848 Compare March 19, 2026 03:09
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 9708848 to 4138a23 Compare March 19, 2026 03:21
Job Launcher for K8s environement
GPU, image, pvc updated and working
Add codes

Add unit tests
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 4138a23 to fb21e24 Compare March 19, 2026 03:36
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch 12 times, most recently from 102c57e to a1f1228 Compare March 19, 2026 20:22
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch 8 times, most recently from 34fee96 to 59124fb Compare March 19, 2026 22:08
@IsaacYangSLA
Copy link
Collaborator Author

/build

@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch 13 times, most recently from a21c83d to 27a8b46 Compare March 20, 2026 01:55
Address comments
@IsaacYangSLA IsaacYangSLA force-pushed the add_k8s_job_launcher_and_design_docs branch from 27a8b46 to 2d9eb0f Compare March 20, 2026 02:09
@IsaacYangSLA
Copy link
Collaborator Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants